JL ' • sjl'r 

Uiu 

I/LCiJi*. • . , .3 



, ,Ti 



NAVAL POSTGRADUATE SCHOOL 
Monterey , California 




THESIS 

^&54-k| 

AN ANALYSIS OF SECURITY BACKGROUND 
BACKGROUND INVESTIGATION DATA WITH 
RELATION TO SUBSEQUENT DISCHARGE 

By 

Edward R. Koucheravy 
September 1988 

Thesis Advisor: Peter A. W. Lewis 



Approved for public release; distribution is unlimited 



T242Q16 



rlassiiled 

•itv classification of this page 



REPORT DOCUMENTATION PAGE 



Report Security Classification L nclassified 



lb Restrictive Markings 



iecurity Classification Authority 



Declassification Downgrading Schedule 



3 Distribution Availability of Report 

Approved for public release; distribution is unlimited. 



jrforming Organization Report Number(s) 



5 Monitoring Organization Report Numberfs) 



same of Performing Organization 

val Postgraduate School 



6b Office Symbol 
(if applicable) 55 



7a Name of Monitoring Organization 
Naval Postgraduate School 



address (city, state, and ZIP code ) 

nterey, CA 93943-5000 



7b Address (city, state, and ZIP code) 

Monterey, CA 93943-5000 



\'ame of Funding Sponsoring Organization 



8b Office Symbol 
( if applicable ) 



9 Procurement Instrument Identification Number 



vddress (city, state, and ZIP code) 



10 Source of Funding Numbers 



Program Element No Project No Task No Work Unit Accession No 



Hie (include security classification) AX ANALYSIS OF SECURITY BACKGROUND INVESTIGATION DATA AND 
E RELATIONSHIP WITH SUBSEQUENT DISCHARGE 



^rsonal Author(s) Edward R. 


Koucheravv 


Type of Report 


13b Time Covered 


14 Date of Report (year, month, day) 


15 Page Count 


ster's Thesis 


From To 


September 1988 


116 



upplemeniary Notation The views expressed in this thesis are those of the author and do not reflect the official policy or po- 
>n of the Department of Defense or the U.S. Government. 



:osati Codes 



1 



1 


Group 


Subgroup 















18 Subject Terms (continue on reverse if necessary and identify by block number) 

Cross-Tabulation, Chi-Square Test, Security Investigation, 



abstract (continue on reverse if necessary and identify by block number) 

This thesis is concerned with the analysis of security investigation data extracted from the investigative files of 564 U.S. 
ry first-term enlisted personnel who came on active duty between 1979 and 1982. The individuals had all completed their 
term of service and had either completed service satisfactorily or had been released early with an adverse discharge. The 
i w'as selected from six character-of-service categories: good, homosexual, drug alcohol abuse, misconduct, court martial, 
character and behavior disorders. The purpose of the thesis was to investigate optimal ways to configure a large, cate- 
cal data base and to look for and quantify relationships between investigative data and final disposition of service. Several 
^worthy relationships were found between derogatory information developed in the investigation and the subsequent 
racter-of- service. Further avenues of investigation using this data are suggested. 



istribution Availability of Abstract 
nclassified unlimited □ same as report 


□ DTIC users 


21 Abstract Security Classification 

Unclassified 


same of Responsible Individual 
t A. W. Uewis 


22b Telephone (include Area code) 

(408) 646-2283 


22c Office Symbol 
55Lw 


r ORM 1473,84 MAR 


83 APR edition may be used until exhausted 


security classification of this page 



All other editions are obsolete 



Unclassified 



i 



Approved for public release; distribution is unlimited. 

An Analysis of Security Background 
Investigation Data and the Relationship 
With Subsequent Discharge 

by 

Edward R. Koucheravy 
Captain, United States Army 
B.S., United States Military Academy, 1978 

Submitted in partial fulfillment of the 
requirements for the degree of 

MASTER OF SCIENCE IN OPERATIONS RESEARCH 

from the 

NAVAL POSTGRADUATE SCHOOL 
September 1988 



ABSTRACT 



This thesis is concerned with the analysis of security investigation data extracted 
from the investigative files of 564 U.S. Navy' first-term enlisted personnel who came on 
active duty between 1979 and 1982. The individuals had all completed their first term 
of service and had either completed service satisfactorily or had been released early with 
an adverse discharge. The data was selected from six character-of-service categories: 
good, homosexual, drug/alcohol abuse, misconduct, court martial, and character and 
behavior disorders. The purpose of the thesis was to investigate optimal ways to con- 
figure a large, categorical data base and to look for and quantify relationships between 
investigative data and final disposition of service. Several noteworthy relationships were 
found between derogatory information developed in the investigation and the subse- 
quent character-of-service. Further avenues of investigation using this data are 
suggested. 



i 

Kl634io / 



TABLE OF CONTENTS 

I. INTRODUCTION 1 

A. BACKGROUND 1 

B. THE SECURITY INVESTIGATION PROCEDURE 1 

C. BACKGROUND OF THE SPECIAL BACKGROUND INVESTIGATION 

DATA BASE (SBID) 3 

D. PURPOSE 5 

E. LIMITATIONS 5 

F. ANALYTICAL TOOLS USED 5 

G. ORGANIZATION OF THESIS 5 

II. DATA REDUCTION 6 

A. GENERAL 6 

B. DATA EDITING 6 

C. DATA REPRESENTATION PROBLEMS 9 

D. RECOMMENDATIONS FOR CODING A LARGE DATA BASE 13 

E. RECOMMENDATIONS FOR IMPLEMENTING A LARGE DATA 

BASE 13 

III. DATA ANALYSIS 14 

A. GENERAL APPROACH FOR THE EXPLORATORY DATA ANALYSIS 14 

B. RESIDUAL ANALYSIS 21 

C. APPLICATION TO THE ACTUAL DATA 23 

D. ANALYSIS OF DEROGATORY INFORMATION 24 

1. General 24 

2. Tests for Differences in Probability Involving Derogatory Information . 24 

3. Residual Analysis Involving Derogatory Information 25 

4. General Comments About the Derogatory Information Cross-Tabulation 36 

E. ANALYSIS OF RECOMMENDATION DATA 37 

F. FURTHER TESTING 44 

IV. FURTHER ANALYSIS 45 



IV 



r • na •' ‘ n . v 

~ ^rcoL 



A. GENERAL 45 

B. ANALYSIS OF PRODUCTIVITY OF SOURCES 45 

C. ANALYSIS OF THE WEIGHTED RECOMMENDATION SCORE .... 47 

V. CONCLUSIONS AND RECOMMENDATIONS 52 

A. FINDINGS 52 

T Cross Tabulation and Residual Analysis 52 

a. General Findings 52 

b. Specific Findings 

2. Productivity of Sources 53 

3. Weighted Recommendation Score as a Predictor 53 

B. RECOMMENDATIONS FOR FURTHER STUDY INVOLVING THIS 

DATA 54 

C. RECOMMENDATIONS FOR FURTHER STUDIES 54 

APPENDIX A. FREQUENCY TABULATIONS OF EACH VARIABLE 55 

APPENDIX B. FOUR-DIGIT DEROGATORY INFORMATION CODES ... 93 

APPENDIX C. BOOTSTRAP RESIDUAL PROGRAM 96 

APPENDIX D. INPUT FILE FOR THE BOOTSTRAP SIMULATION PRO- 

GRAM 100 



APPENDIX E. OUTPUT FROM THE BOOTSTRAP SIMULATION OF THE 
LARGEST RESIDUAL 

LIST OF REFERENCES 

INITIAL DISTRIBUTION LIST 



V 



I. INTRODUCTION 



A. BACKGROUND 

The importance of protecting sensitive military information and operations from 
potentially hostile sources is a concept as old as warfare itself. Events of the recent past 
indicate that the nation must never grow complacent about its ability to safeguard clas- 
sified information. World-wide defense commitments, the ideological and historic dif- 
ferences existing between the US and other nations, and the huge number of people who 
frequently access, create, analyze and service the vast amount of sensitive information 
combine to create a tremendous managerial problem: Who can be trusted with access 
to the nation's security secrets? 

The need to investigate the backgrounds of those people needing access to classified 
information has been a fixture of the national security establishment for many years. 
Typically, an individual, by virtue of his duty responsibilities, is determined to need reg- 
ular access to sensitive information of some level (secret, top secret, sensitive compart- 
mentalized information, etc). A fairly standard administrative procedure is employed 
throughout the Department of Defense (DOD) in order to determine whether the person 
should be allowed access to classified information, 

B. THE SECURITY INVESTIGATION PROCEDURE 

The first element in a security investigation is the completion of a detailed form 
named the Statement of Personal History (SPH). The SPH requires specific information 
about a person's past. Information such as a list of close family members, foreign travel, 
arrests and convictions, schools attended, jobs held, creditors, and personal references 
are all required. The SPH is the starting point for any security investigation. 

The next step in the investigation consists of the National Agency Check (NAC) and 
the Local Agency Check (LAC). Law enforcement agencies, both local (i.e., city or state 
police) and national (i.e., the FBI) are queried about outstanding warrants and records 
of arrests. A check of credit information is also conducted with national and local credit 
bureaus to determine whether an individual has money problems. 

The clearance will normally be granted to a person who requires access to informa- 
tion with a classification of Secret or lower when the above procedure does not turn up 
any inconsistencies. 



1 



A person requiring access to top secret or higher level information will undergo a 
much more detailed investigation: a background investigation (BI), or a special 

background investigation (SBI). These investigations are much more thorough than 
those for lesser clearances and involve actual interviews with people who know and have 
developed a relationship with the individual being investigated. Neighbors, friends, 
school officials, former employers and others may be interviewed. If the answers are 
consistent and positive, the subsequent investigation will be much less detailed than if a 
negative trend develops and other sources of information are "developed" by the inves- 
tigators. If information is developed which contradicts that listed on the Statement of 
Personal History or is conspicuously absent from it, the subject will almost certainly 
be interviewed. In certain other types of investigations, an interview is always required. 

The result of this investigation is a dossier containing basic biographical data, de- 
rogatory information obtained from the SPH and other sources (or lack of such infor- 
mation) and recommendations as to the trustworthiness of the subject of the 
investigation. Derogatory information varies from traffic infractions to emotional 
problems to felonies. All the investigative data is gathered for the clearance determi- 
nation. An adjudictor reads the investigation file and makes the judgement as to the 
award of the clearance. 

The last step in the security investigation process is a review of the information ob- 
tained and determination of whether the clearance should be granted. 

Review of the information is performed in accordance with Adjudication guidelines 
contained in the DOD Personal Security Regulation, DOD 5200. 2-R, dated January, 
1987. The factors which can disqualify an individual for a clearance are listed as well 
as the mitigating factors which might allow a clearance to be granted even though a 
disqualifying factors are present in the information. For example, a person might admit 
to experimental use of marijuana (less than six instances of use) in their adolescence. 
This use of cannabis (marijuana or its derivatives) is considered a disqualifying factor. 
A mitigating factor in this instance is that the experimental abuse occurred more than 
six months ago, and the individual has no intention of using cannabis or other drugs in 
the future [Ref. 1]. 

The final determination of clearance for an individual whose record contains dis- 
qualifying information is a subjective one. It is based upon the merits of the case, and 
the evaluation of the adjudicator as to the mitigating factors which hopefully indicate 
the actual reliability of the individual in the future. 



2 



C. BACKGROUND OF THE SPECIAL BACKGROUND INVESTIGATION DATA 
BASE (SB ID) 

It is apparent that the investigation procedure must generate a tremendous amount 
of data about every person who is investigated for a security clearance. It is clear that 
we do not wish to trust national security information to those who are untrustworthy 
enough to violate laws, regulations, and accepted standards of conduct. Could this data 
be used to examine whether data obtained from the security investigations were in any 
way related to the future service record of those investigated? Could this data provide 
insight into the investigation process, allowing investigative resources to be more effi- 
ciently allocated? 

The Defense Personal Security Research and Education Center (PERSEREC) in 
Monterey, California was directed to examine a large sample of data produced from se- 
curity investigations of first-term enlistees entering the Navy during the years 1979 - 
1982. The purpose of the study was to develop insight about the information developed 
in security investigations, especially when the final disposition of service of investigative 
subjects was known. 

The individuals whose records were involved in the study: 

1. Had background investigations initiated within three months of enlistment: 

2. Were separated or discharged during, or upon completion of their initial tour of 
duty; 

3. Were discharged for homosexuality, misconduct, drug abuse, court martial, char- 
acter and behavior disorder, or normal completion of enlistment. 

Thus, in the data base, there are five types of unsuitability discharge categories and 
one control group of personnel who successfully completed their term of service. 

Seven-hundred records were selected randomly (based upon the last digit of the so- 
cial security number) for the study. One-hundred cases were selected from each of the 
five unsuitability discharge groups and two-hundred cases in which the individuals were 
normally separated. The number of cases w'hich were eventually included in the study 
numbered 564 because those cases where the Background Investigation was cancelled for 
any reason were removed. 

The number of records chosen in each category were not in relation to the charac- 
ter-of-service category's proportion in the actual population. An immense number of 
records would need to be drawn as a single sample in order to get a large enough rep- 
resentation from each adverse discharge category. As an illustration, consider that there 



3 



are 73 records in this data base from the court martial character of service category. 
Persons who are investigated receive this adverse character-of-service designation ap- 
proximately 0.18% of the time. Simple arithmetic indicates that to get approximately 
73 records in this category' from a single sample from the investigation population at 
large would require a sample size of nearly 41,000. It seems obvious that this is not 
reasonable. Table 1 displays the approximate percentages of those initially investigated 
who receive each of the six character-of-service designations discussed in this thesis [Ref. 
2]. There are other designations which are not considered here. 



Table 1. CHARACTER OF SERVICE CATEGORY PROPORTIONS 



Character of Service Category' 


% of Investigation Population Receiving 
Category 


Good 


90.4% 


Homosexual 


0.92% 


Misconduct 


1.2% 


Drug 'Alcohol Abuse 


1.8% 


Court Martial 


0.18% 


Character Behavior Disorder 


0.65% 



The data base was created by taking the investigation information from microfiche 
and entering it into a Lotus 123 spreadsheet. There were 93 possible entries for each of 
the 564 records resulting in a total data base with the potential for approximately 52,500 
data points. 

The data was essentially categorical in nature with an individual record containing 
personal information ranging from date of birth and military' specialty to findings from 
high school to type of discharge. A four-digit code representing the type of derogatory' 
information was the prime means of listing this data and allowed standardization across 
the data base. Other codes were created to represent other pieces of information such 
as the recommendations obtained at the various sources (high schools, colleges, neigh- 
borhoods, etc.), race or marital status. 

Problems with the size of the data base, the slow response of an AT-style micro- 
computer when dealing with such a large data set, and the limitations of Lotus 123 in 
performing statistical functions allowed only a cursory' analysis of the data base as 



4 



originally implemented. Clearly another approach was necessary to analyze and obtain 
insights from this data. 

D. PURPOSE 

The purpose of this thesis is two-fold: to investigate some available methods for 
organizing and analyzing a large, categorical data base; and to use statistical and data- 
analytical techniques to evaluate the personal security data detailed above in order to 
develop insights and correlations between the security investigation data and the subse- 
quent disposition of the subject's term of enlistment. 

E. LIMITATIONS 

The data used in this paper was analyzed as provided. It was not possible to ensure 
actual random selection of the data, however we assume that each sample was selected 
randomly. The data was selected in an arbitrary manner (one-hundred records from 
each of the unsuitability discharge categories and two-hundred records with normal 
completion of service). It may be difficult to apply the results of this investigation to the 
general population. 

F. ANALYTICAL TOOLS USED 

The data was initially reduced and documented using the Statgraphics (version 2.6) 
statistical software package on a Compaq 286 portable personal computer with two 
megabytes of additional random access memory (RAM). After reduction it was trans- 
ferred to an IBM 3033 System 370 mainframe computer using the MVS batch system. 
On the mainframe computer, Grafstat, an unreleased IBM mainframe data analysis and 
statistical package was used. In addition, APL programs for categorical analysis were 
written using APL Graphpak to supplement the routines available in Grafstat. 

G. ORGANIZATION OF THESIS 

Following this introduction, the data reduction techniques used for this thesis and 
the lessons learned from that effort are discussed in Chapter II. The main body of the 
thesis is contained in the Chapter III and deals with the data operations and the analysis 
conducted. Chapter IV discusses some promising areas for further analysis which were 
only briefly pursued because of time constraints. The closing chapter will summarize the 
results of this research, set forth the conclusions drawn from those results and provide 
recommendations for future research involving this data. 



5 



II. DATA REDUCTION 



A. GENERAL 

PERSEREC experienced problems in attempting to analyze a data base of this 
magnitude. This led them to investigate other methods of configuring the data in order 
to perform the analysis they felt was necessary. Subsequently, the Lotus 123 files were 
exported to the mainframe computer and configured into Conversational Monitoring 
System (CMS) ASCII files. The categorical nature of the data and its overwhelming size 
dictated that documentation and verification of the data base was necessary’ before any 
further useful analysis could be performed. However, the data editors available in CMS 
on the mainframe computer did not offer the ability to easily operate on column fields 
and did not have the flexibility needed to simultaneously document the work performed 
as it proceeded. 

B. DATA EDITING 

Statgraphics (version 2.6) offered a user-friendly data editor offering the requisite 
capabilities. Unfortunately, it was available only on a personal computer. A Compaq 
2S6 portable AT-compatible micro-computer with two megabytes of additional memory 
(useable as a virtual disk) was used. It proved extremely useful: however, its size limited 
the amount of data which could be operated upon without exceeding the memory limi- 
tations of the computer (these memory restrictions will be alleviated in the future when 
using the new 803S6 based machines). 

The CMS files were transferred into micro-computer ASCII files and then stored 
on floppy disks and subsequently read into six Statgraphics (ASF) files. Each of the files 
consisted of approximately 15 of the variable entries for each of the 564 records (ap- 
proximately 8400 data points). At any one time six or seven of these variables could be 
operated upon within the data editor. 

A general procedure wasjfollowed in formatting and verifying each of the six files. 
First, the file was checked to insure that the data, as it existed on the CMS files, had 
been transferred correctly. In one instance half of the field of one variable was truncated 
and had to be reconstructed. 

Next, the numeric coding used for each column was researched and ambiguities re- 
solved by recoding or removal. This step required considerable research into the coding 



6 



methods and the investigation process in order to understand, and, if necessary, change 
the numeric codes for the sake of clarity. 

Finally, a frequency tabulation of each column was performed and labels were cre- 
ated which corresponded to the coded values. These labels were especially useful later 
in the analysis when cross-tabulations between variables vectors were conducted. 

The procedure discussed above was iterative as sometimes several interpretations 
resulted before one was confirmed as correct. Documentation of the data base was 
conducted throughout these three steps. The list of the variables contained in the data 
base, their purpose and their types are contained in Figure 1 through Figure 3 . These 
figures are a direct copy of the file management screen that appears in Statgraphics as 
you enter the full-screen editor or view the data directory. Comments are limited to 21 
characters for each variable. 



VARIABLE 


WIDTH 


TYPE 


RANK 


LENGTH 


DATE 


TIME 


COMMENT 


A 


5 


I 


1 


564 


3/18/88 


11: 59 


RECORD NO. (RANDOM) 


C 


3 


I 


1 


564 


2/26/88 


11: 08 


SEX (MALE OR FEMALE) 


D 


8 


D 


1 


564 


3/18/88 


13: 02 


BIRTHDATE 


F 


8 


D 


1 


564 


3/18/88 


14: 01 


DATE OF ENTNAC 


G 


8 


D 


1 


564 


3/18/88 


14: 01 


BI REQUEST DATE 


I 


3 


I 


1 


564 


2/26/88 


11: 08 


REASON FOR BI 


J 


4 


I 


1 


564 


2/26/88 


11: 08 


OCCUPATION CODE 


K 


3 


I 


1 


564 


2/26/88 


11: 10 


REASON FOR INTERVIEW 


LI 


6 


I 


1 


564 


2/26/88 


12: 29 


INTERVIEW INFO - 1. 


L2 


6 


I 


1 


564 


2/26/88 


12: 29 


INTERVIEW INFO - 2. 


L3 


6 


I 


1 


564 


2/26/88 


12: 29 


INTERVIEW INFO - 3. 


L4 


6 


I 


1 


564 


2/26/88 


12: 29 


INTERVIEW INFO - 4. 


Ml 


6 


I 


1 


564 


2/26/88 


12: 29 


FBI/DCII FINDINGS1 


M2 


6 


I 


1 


564 


2/26/88 


12: 29 


FBI/DCII FINDINGS2 


N1 


6 


I 


1 


564 


2/26/88 


14: 03 


LOCAL AGENCY CHECK 


N2 


6 


I 


1 


564 


2/26/88 


14: 31 


LOCAL AGENCY CHECK 


N3 


6 


I 


1 


564 


2/26/88 


14: 03 


LOCAL AGENCY CHECK 


N4 


6 


I 


1 


564 


2/26/88 


14: 03 


LOCAL AGENCY CHECK 


01 


6 


I 


1 


564 


2/26/88 


14: 03 


CREDIT BUREAU CHECK 


02 


6 


I 


1 


564 


2/26/88 


14: 03 


CREDIT BUREAU CHECK 


P 


4 


I 


1 


564 


2/26/88 


10: 59 


H S - // OF SOURCES 



Figure 1. List of Variables Contained in the Data Base: Extracted from the Stat- 

graphics Data Management Screen. 



7 



VARIABLE 


WIDTH 


TYPE 


RANK 


LENGTH 


DATE 


TIME 


COMMENT 


Q1 


3 


I 


1 


564 


3/ 4/88 


14: 32 


HIGH SCHOOL RECOMM. 


Q2 


6 


I 


1 


564 


3/ 4/88 


14: 32 


HIGH SCHOOL RECOMM. 


Q3 


6 


I 


1 


564 


3/ 4/88 


14: 32 


HIGH SCHOOL RECOMM. 


R1 


3 


I 


1 


564 


2/26/88 


15:47 


HIGH SCHOOL FINDINGS 


R2 


5 


I 


1 


564 


2/26/88 


16: 30 


HIGH SCHOOL FINDINGS 


R3 


5 


I 


1 


564 


2/26/88 


16: 30 


HIGH SCHOOL FINDINGS 


R4 


5 


I 


1 


564 


2/26/88 


16: 30 


HIGH SCHOOL FINDINGS 


s 


3 


I 


1 


564 


2/26/88 


10: 59 


COLL. - // OF SOURCES 


T 


4 


I 


1 


564 


2/26/88 


11: 00 


COLL. RECOMMENDATION 


U 


5 


I 


1 


564 


2/26/88 


11: 00 


COLLEGE FINDINGS 


V 


3 


I 


1 


564 


2/26/88 


11: 00 


EMPL. # OF SOURCES 


W 


3 


I 


1 


564 


2/26/88 


10: 53 


CO-WORKER # SOURCES 


XI 


3 


I 


1 


564 


3/ 4/88 


11: 28 


EMPLOYMENT RECOMM. 


X2 


6 


I 


1 


564 


3/ 4/88 


11: 15 


EMPLOYMENT RECOMM. 


X3 


6 


I 


1 


564 


3/ 4/88 


11: 15 


EMPLOYMENT RECOMM. 


Y1 


5 


I 


1 


564 


3/ 4/88 


13: 36 


EMPLOYMENT FINDINGS 


Y2 


5 


I 


1 


564 


3/ 4/88 


13: 36 


EMPLOYMENT FINDINGS 


Y3 


5 


I 


1 


564 


3/ 4/88 


13: 36 


EMPLOYMENT FINDINGS 


Y4 


5 


I 


1 


564 


3/ 4/88 


13: 36 


EMPLOYMENT FINDINGS 


Z 


2 


I 


1 


564 


2/26/88 


10: 54 


NEIGH. it OF SOURCES 


AAl 


3 


I 


1 


564 


3/ 4/88 


12: 02 


SPH NEIGH. RECOMM. 


AA2 


3 


I 


1 


564 


3/ 4/88 


12: 02 


DEV. NEIGH. REC. 


AA3 


6 


I 


1 


564 


3/ 4/88 


12: 02 


DEV. NEIGH. REC. 


AB1 


5 


I 


1 


564 


3/ 4/88 


14: 06 


NEIGH. FINDINGS 


AB2 


5 


I 


1 


564 


3/ 4/88 


14: 06 


NEIGH. FINDINGS 


AB3 


5 


I 


1 


564 


3/ 4/88 


14: 06 


NEIGH. FINDINGS 


AC 


3 


I 


1 


564 


2/26/88 


10: 56 


it OF OTHER SOURCES 


ADI 


3 


I 


1 


564 


3/ 4/88 


15: 17 


OTHER RECOMM. 


AD2 


6 


I 


1 


564 


3/ 4/88 


15: 17 


OTHER RECOMM. 


AE1 


5 


I 


1 


564 


3/11/88 


08: 29 


OTHER FINDINGS 


AE2 


5 


I 


1 


564 


3/11/88 


08: 29 


OTHER FINDINGS 


AE3 


5 


I 


1 


564 


3/11/88 


08: 29 


OTHER FINDINGS 


AE4 


5 


I 


1 


564 


3/11/88 


08: 29 


OTHER FINDINGS 


AF 


2 


I 


1 


564 


2/26/88 


10: 46 


RACE 



Figure 2. List of Variables Contained in the Data Base (Continued): Extracted 

from the Statgraphics Data Management Screen. 



8 



VARIABLE 


WIDTH 


TYPE 


RANK 


LENGTH 


DATE 


TIME 


COMMENT 


AG 


2 


I 


1 


564 


2/26/88 


10: 48 


MARITAL STATUS 


AJ 


2 


I 


1 


564 


2/26/88 


10: 48 


DEPENDENTS 


AN 


6 


I 


1 


564 


2/26/88 


10: 48 


# OF SIBLINGS 


AO 


3 


I 


1 


564 


2/26/88 


10: 48 


PERMANENT RESIDENCE 


AQ 


7 


I 


1 


564 


3/18/88 


14: 17 


ENLISTMENT DATE 


AR 


5 


1 


1 


564 


2/26/88 


10: 48 


AGE AT ENLISTMENT 


AS 


4 


I 


1 


564 


2/26/88 


10: 49 


MONTHS HS TO ENLIST 


AT 


4 


I 


1 


564 


2/26/88 


10:49 


# JOBS HS TO ENLIST 


AU 


3 


I 


1 


564 


2/26/88 


10:49 


# MONTHS UNEMPL. 


AV 


3 


I 


1 


564 


2/26/88 


10: 49 


# MONTHS COLLEGE 


AW 


3 


I 


1 


564 


2/26/88 


10: 49 


MO. UNEMPL. PRIOR ENL 


AX1 


5 


I 


1 


564 


3/11/88 


12: 54 


UNFAV. INFO. ON SPH 


AX 2 


5 


I 


1 


564 


3/11/88 


12: 54 


UNFAV. INFO. ON SPH 


AX 3 


5 


I 


1 


564 


3/11/88 


12:54 


UNFAV. INFO. ON SPH 


AX4 


5 


I 


1 


564 


3/11/88 


12: 54 


UNFAV. INFO. ON SPH 


AY1 


5 


I 


1 


564 


3/11/88 


11: 36 


SUMMARY BI 


AY2 


5 


I 


1 


564 


3/11/88 


11: 36 


SUMMARY BI 


AY3 


5 


I 


1 


564 


3/11/88 


11: 36 


SUMMARY BI 


AY4 


5 


I 


1 


564 


3/11/88 


11: 36 


SUMMARY BI 


BB 




C 


2 


564 8 


3/18/88 


14: 42 




BC 


3 


I 


1 


564 


2/26/88 


10: 33 


CLEARANCE TYPE 


BD 




c 


2 


564 8 


3/18/88 


14: 43 


CLEARANCE REV. : DATE 


BE 




c 


2 


564 8 


3/18/88 


14: 44 


DATE OF SEPERATION 


BF 


3 


I 


1 


564 


2/26/88 


10: 36 


RELEASE CODE 


BG1 


5 


I 


1 


564 


3/11/88 


14: 06 


MILITARY OFFENSES 


BG2 


5 


I 


1 


564 


3/11/88 


14: 06 


MILITARY OFFENSES 


BG3 


5 


I 


1 


564 


3/11/88 


14: 06 


MILITARY OFFENSES 


BG4 


5 


I 


1 


564 


3/11/88 


14: 06 


MILITARY OFFENSES 


BH1 


5 


I 


1 


564 


3/11/88 


14: 29 


REMARKS/DISCHARGE 


BH2 


5 


I 


1 


564 


3/11/88 


14: 30 


REMARKS /DISCHARGE 


BH3 


5 


I 


1 


564 


3/11/88 


14: 30 


REMARKS/DISCHARGE 


BH4 


5 


I 


1 


564 


3/11/88 


14: 30 


REMARKS/DISCHARGE 


BL 


4 


I 


1 


564 


2/26/88 


10: 08 


STATUS OF 5520/20 


BM 


5 


I 


1 


564 


2/26/88 


10: 08 


DISCHARGE CASE CODE 


BO 


3 


I 


1 


564 


2/26/88 


10: 08 


INTERSVC. SEP. CODE 


BP 


2 


I 


1 


564 


2/26/88 


10: 08 


CHARACTER OF SERVICE 


BQ 


2 


I 


1 


564 


2/26/88 


10: 08 


TYPE OF DISCHARGE 



Figure 3. List of Variables Contained in the Data Base (Continued): Extracted 

from the Statgraphics Data Management Screen. 



C. DATA REPRESENTATION PROBLEMS 

Inherent in the verification and documentation of a large data base obtained from 
an outside source are coding inconsistencies. Ideally, thorough documentation of the 



9 



codes used and the thought process employed in creating the data base is included with 
it. However, this is seldom the case. 

The PERSEREC data base had many inconsistencies along with several strengths. 
A major strength of the data organization was the standardization of most of the coding 
employed. Derogatory information codes (used in 43 of the 93 columns) and recom- 
mendation codes (used in 13 of the columns) were used in a fairly standard manner. The 
numeric code for all derogatory information contained in the data base consisted of a 
standard four-digit code representing 135 different infractions. The list of infractions 
and their codes is listed in Appendix B. 

The numeric code used for the types of recommendations obtained from various 
sources consisted of a two-digit integer representing the total number of persons who: 

1. Recommended the subject for a position of trust; 

2. Recommended the subject for a position of trust, with supervision; 

3. Did not recommend the subject for a position of trust; 

4. Declined comment. 

Most sources of derogatory information are represented by several columns in the 
data base. A source is considered a location such as college, high school, employer, 
neighborhood, etc.. Multiple columns are available for each source category to allow 
room for several different types of derogatory information to be displayed, if necessary. 
Table 2 shows how the information of columns VI, Y2, Y3, and Y4 (findings or derog- 
atory information obtained from employers) was represented: 



Table 2. INITIAL REPRESENTATION OF DEROGATORY INFORMATION 
(EXAMPLE). 



Record Number 


Y1 


Y2 


Y3 


Y4 


1 


9999 


9999 






2 


9999 


1071 


1106 




3 


1829 


9999 


1844 


9999 


4 


1805 


1824 







After research, these records were interpreted in the following manner: If there are 
only 9999 entries in a particular record's entries in Y1 - Y4, then no derogatory' 



10 



information from the subject's former employers was found. The possibility of no in- 
terview being conducted is reasonable, although all information indicates that former 
employers were visited in almost all instances. If any 9999 entries are contained along 
with derogatory information for a particular record, those 9999 codes are meaningless. 
The corrected records are shown in Table 3 . 



Table 3. REPRESENTATION OF DEROGATORY INFORMATION AFTER 
REDUCTION (EXAMPLE). 



Record Number 


Y1 


Y2 


Y3 


Y4 


1 


9999 








2 


1071 


1106 






3 


1829 


1844 






4 


1805 


1824 







In this table no information was obtained on the person represented by record 
number 1. For the second person, the investigator found evidence that the person was 
known to lie (1071), and that he was at some time intoxicated in public (1 106). The third 
person had evidence of vandalism (1829) and malicious mischief (1844). The fourth 
person was found to have an incident of reckless driving (1805) and also illegal use of a 
firearm (1824). 

Columns representing derogatory information obtained from colleges, high schools, 
neighbors, and other sources were similarly reduced. 

As discussed above, the 9999 code used in columns Y1 - Y4 represented "no derog- 
atory information." Research revealed that this interpretation of the 9999 code could 
not be used in some of the other columns. In the security investigation realm, employers 
and neighbors are considered "productive" sources. With that designation, the former 
employers and neighbors of a subject are almost always interviewed, thus the 9999 code 
for those sources means "no derogatory information." Sources other than employers and 
neighbors, on the other hand, are normally only visited by an investigator when he is 
fairly certain to obtain derogatory' information. The 9999 code in conjunction with these 
types of sources means "no interview conducted." 

An even more confusing coding scheme was discovered relating to the recommen- 
dations obtained from the five types of sources outlined above. For the employer, high 



11 



school, college and other sources, a 99 code represents "no interview." The coding for 
neighborhood recommendations was different. 

Neighborhoods are the source of many developed sources of derogatory informa- 
tion. A distinction was made between the recommendations of neighbors listed on the 
SPH (generally positive) and those from neighborhood sources developed by the inves- 
tigators. This resulted in four possible entries for recommendations from a subject's 
neighborhood. The column vectors representing information obtained from the subject's 
neighbor are designated AA1-AA4. Column AA1 represents the recommendations ob- 
tained from persons listed on a subject's Statement of Personal History. Entries in col- 
umns AA2-AA4 were recommendations obtained from neighborhood sources developed 
by the investigator. A 99 entry in column AA1 meant "no interview conducted," while 
a 99 entry in column AA2 means "no sources developed." Furthermore, a 99 entry' in 
columns AA3 or AA4 meant nothing. These variable fields were repaired by removing 
all 99 codes from columns AA3 and AA4. 

Another instance of miscoding occurred in column AN, which represents the num- 
ber of siblings of the subject. Throughout the field a character code of "Li" existed along 
with the usual integers ( 1,2,...) representing the number of siblings. This code was 
thoroughly researched until the only possible explanation was obtained— it represented 
"unknown." 

The problems highlighted here point to the importance of differentiating, by coding, 
even small differences in meaning when implementing codes. The failure to do so risks 
losing important distinctions which may in fact invalidate the data. Another point to 
be made is that documentation is essential when data bases are created. Luckily, the 
person who performed the data entry' was available for reference throughout the data 
reduction stage of this project, otherwise much of the information contained in the data 
base might have been lost. 

Erroneous entries were not commonly found in the data base. Only two erroneous 
codes (not of the 135 actual derogatory information codes) were found and they were in 
the same column. Research into the underlying record revealed that the codes had digits 
transposed and the corrections w r ere easily made. 

Missing values, or blanks, were common in some columns. Care had to be taken 
to preserve these blanks when transferring from one system to another. The Stat- 
graphics representation of blanks as the integer -32768 proved useful in this regard. 



12 



The files were initially represented in a random order by record number. This 
proved inconvenient when cross-validation of the record to its original file was necessary. 
The use of APL in conjunction with Statgraphics allowed all records to be reorganized 
in ascending order and made the file much easier to reference. 

Date fields were entered as six-digit codes representing month-day-year. Problems 
were encountered with formatting as Statgraphics requires a slash (/) between the month 
and day and the day and year. A simple APL function was written which performed this 
conversion. 

D. RECOMMENDATIONS FOR CODING A LARGE DATA BASE 

1. Care must be taken to differentiate even subtle variations in meaning by using dif- 
ferent codes. 

2. The data base must be designed with the proper analytical tool (software and 
hardware) consistent with the purpose and goals of the analysis. 

3. Proper documentation is essential when creating a data base. This is important not 
only for the data base creators to have for their own memory, but also so that 
others may use the data base. It is also important because others may use the data 
long after the creator has finished with it and is available to answer questions. 

4. Design of the data base should be a slow, careful affair. If this stage is neglected, 
the data base designer risks wasting many hours of work and compromising the real 
value of the data base. 

E. RECOMMENDATIONS FOR IMPLEMENTING A LARGE DATA BASE 

Statgraphics has a scrollable data editor which allows the entry, manipulation, and 
review of large data bases. It is convenient, simple to use, and, most importantly, makes 
it easy to correct and manipulate the data when anomalies are detected. 

In view of the value that such a scrollable data editor provided when reducing and 
documenting a data base which is already in existence, here are some recommendations 
for data base design. The design should: 

1. Allow for speedy input of and access to new data; 

2. Allow the data to be manipulated and massaged with scrollable full-screen data 
editors; 

3. Allow easy access by statistical graphics packages such as Grafstat. 



13 



III. DATA ANALYSIS 



A. GENERAL APPROACH FOR THE EXPLORATORY DATA ANALYSIS 

The primary question which this thesis attempts to answer is, "What relationships 
exist between the information derived from the subjects' background investigations and 
the final disposition of their service?" The answers obtained here will not, of course, be 
all inclusive but provide a starting point for further research involving this data base. 
In particular, this is not the only question to be answered from the data base, but as in 
much research, other questions and facts become apparent as the research progresses. 

Inherent in a data analysis is the initial investigation into the properties and limita- 
tions of the data. The PERSEREC special background investigation data (SB1D) is 
primarily categorical in nature. The record for each individual contains several different 
types of information: 

1. Background and biographical information such as age, marital status, reason for 
investigation, etc.; 

2. Derogatory information (or lack thereof) obtained by investigators from various 
sources (high school, neighborhood, employers, etc.); this information may consist 
of crimes, subject admissions, and other matters that reflect on the person's char- 
acter and judgement; 

3. Recommendations from various people associated with these sources as to whether 
they felt that the individual in question should be trusted with a position of trust 
and responsibility; 

4. The result of the term of military service, whether the individual was discharged 
normally, or due to some adverse circumstances. 

The data can viewed as information obtained prior to the completion of the inves- 
tigation (explanatory variables or independent variables) and information which is the 
result of the person's service after the investigation (response variables or dependent 
variables). 

Note that the data is basically categorical, e.g., male or female, and thus has no in- 
herent ordering. Thus, while frequency counts can be obtained and are given in Ap- 
pendix A, no distributional measures, e.g., means or variances, can be computed. 
Similarly, dependencies and associations cannot be measured by moments based upon 
joint distributions, e.g., correlation coefficients. 

The data thus appears to be ideal for contingency table methods [Ref. 3 : pp. 153 - 
170]. However, note that the one response variable in the contingency table is almost 



14 



